Movie Predictor v2¶
Project Overview¶
As of May 30, 2025, I’ve watched 119 movies this year, which is equivalent to about nine days of film-watching. As a result of this, I’ve noticed my tolerance for movies I don’t enjoy has significantly decreased. This project uses my Letterboxd film-logging data to better understand my viewing preferences and build a model that predicts how much I’ll enjoy a movie on my watchlist before I watch it.
Data¶
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, make_scorer
from xgboost import XGBRegressor
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import shap
watched = pd.read_csv("../data/v2/watched_2.csv")
watchlist = pd.read_csv("../data/v2/watchlist_2.csv")
Letterboxd doesn't offer API access, but they do offer the option to export user data such as logged films and lists.
I'll be using:
watched.csv - a .csv of films that I've logged
watchlist.csv - a .csv of films I'd like to see
I'l train the model on watched and attempt to predict what my Letterboxd-equivalent rating will be for the films on my watchlist.
watched.iloc[20][['Date', 'Name', 'Year', 'Letterboxd URI']]
Date 2025-05-30 Name Dog Day Afternoon Year 1975 Letterboxd URI https://boxd.it/29bg Name: 20, dtype: object
This was the raw data format of the watched .csv Letterboxd provides, which unfortunately is very limited. I was surprised to find that they didn't provide me with my own ratings for logged films. In hindsight there's probably a way to scrape it, but at the time my solution was to write a script rater.py that prompts the user to manually rate each of the films on the list. Each rating is appended as myRating to the watched dataframe.
Scraping¶
The next step was to scrape the remaining features of each movie from the given Letterboxd URL, which was the most difficult and time consuming part. The scraped features are:
Director - Director of the film
Runtime - Length of the film in minutes
Rating - Letterboxd community rating of each film
Genres - Genre(s) of each film
Themes - Subgenres / Themes that characterize the film
Scraping director names was simple because it used basic HTML tags, but the rest of them had JavaScript elements that I wasn't familiar with and I spent the entire first day of the project learning through trial-and-error how to scrape each feature individually since no two features were able to be done the same way.
Afterwards, I found out that 'Error 429: Too Many Requests' comrpised about half of my fields, so I had to update the scraper to continuously retry the films with this error until the retrieval timing allows for the data to be scraped. Because of this, the entire scraping process takes 2-3 hours to scrape ~300 movies, which is something I'll look to improve upon in the future as it made experimentation with different sets of films very tedious.
I was unable to scrape a few fields that I wanted:
Watches - Number of users who've watched the film
Likes - Number of users who've 'hearted' the film, a designation of enjoyment seprate from the five-star rating scale.
Fans - Number of fans of the film. A 'fan' is someone who has the film in their Top 4 favorites on their user profile.
Total_ratings - Total number of users who contributed to the Rating variable
Fortunately, I found a public dataset of scraped Letterboxd metadata that includes these features for the majority of the films in my collection. I merged these features into my original watched and watchlist dataframes, and then manually entered the data for the 15 or so films that there was no data for.
Cleaning¶
The next step was converting the Genres and Themes into a format suitable for modeling using one-hot encoding. Initially, genres were scraped into a single string format like "Drama, Comedy, Adventure", which isn't a decipherable format for the model. To remedy this, each genre is presented as a unique column where a 1 is indicative that the film includes that genre and 0 indicates that it does not.
watched.head(2)
| Date | Name | Year | Letterboxd URI | myRating | Director | Runtime | Rating | Genres | Themes | ... | Romance | Science Fiction | Thriller | TV Movie | War | Western | Watches | Likes | Fans | Total_ratings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2025-05-30 | The General | 1926 | https://boxd.it/29co | 1.5 | Clyde Bruckman | 79.0 | 4.20 | Comedy, Action, Adventure, War | Crude humor and satire, Epic heroes, Amusing j... | ... | 0 | 0 | 0 | 0 | 1 | 0 | 138036.0 | 39282.0 | 510.0 | 88322.0 |
| 1 | 2025-05-30 | Duck Soup | 1933 | https://boxd.it/25PG | 2.0 | Leo McCarey | 69.0 | 3.93 | War, Comedy | Crude humor and satire, Song and dance, Amusin... | ... | 0 | 0 | 0 | 0 | 1 | 0 | 94004.0 | 24845.0 | 634.0 | 55182.0 |
2 rows × 33 columns
Now the data is in a format that I can model. Date, Letterboxd URI, Director, Genres and Themes are not features that the model uses, and will be removed later.
A note on using Themes as a feature:¶
Letterboxd Themes overlap pretty heavily with Genres, and many of them are redundant. I've found that RateYourMusic.com's film descriptors seem like they would be more accurate by comparison.
Using the movie Apocalypse Now as an example:
Letterboxd:
Genres: "Drama", "War"
Themes: 'War and historical adventure', 'Humanity and the world around us', 'Politics and human rights', 'Epic history and literature', 'Intense violence and transgression', 'Military combat and heroic soldiers', 'Surreal and thought-provoking visions of life and death', 'Epic adventure and breathtaking battles', 'Bravery in War', 'Dreamlike, quirky, and surreal story'
RYM:
Genres: "War", "Psychological Drama", "New Hollywood"
Themes: "Adventure", "Psychological Thriller", "Epic"
Letterboxd Genre "War", and Themes 'War and historical adventure', 'Military combat and heroic soldiers', and 'Bravery in War' all share a substantial degree of overlap here. On the other hand, some of the other categories are so all-encompassing that they don't accurately describe the film. Apocalypse Now is in no way a 'quirky' movie despite the 'Dreamlike sequences' in it.
An earlier version of the model attempts to incorporate Themes, but for the reasons I've described I didn't find them to be a productive input. I would be interested in reintroducing them in the form of RYM's descriptors, but unfortunately they have very strict anti-scraping measures in place which as far as I can tell make it impossible to retrieve this data by any method other than manual entry. I've requested API access, but in the meantime I've come to the conclusion that it's better not to incorporate Themes in their current format.
Exploratory Data Analysis¶
# DISTRIBUTION OF RATINGS (MINE AND LETTERBOXD AVERAGES)
# Side-by-side [1x2]
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.histplot(watched['myRating'], color='blue', alpha=0.5, bins=10, label='myRating', ax=axes[0])
sns.histplot(watched['Rating'], color='gold', alpha=0.5, bins=10, label='LetterboxdRating', ax=axes[0])
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Count')
axes[0].set_title('My Rating vs Letterboxd Rating')
axes[0].legend()
# KDE VERSION
sns.kdeplot(watched['myRating'], fill=True, color='blue', label='myRating', ax=axes[1])
sns.kdeplot(watched['Rating'], fill=True, color='gold', label='LetterboxdRating', ax=axes[1])
axes[1].set_xlabel('Rating')
axes[1].set_ylabel('Density')
axes[1].set_title('(KDE) myRating vs LetterboxdRating')
axes[1].legend()
plt.tight_layout()
plt.show()
print(watched[['myRating', 'Rating']].describe())
myRating Rating count 255.000000 255.000000 mean 3.382353 3.801725 std 1.064773 0.470008 min 0.500000 2.040000 25% 3.000000 3.585000 50% 3.500000 3.860000 75% 4.000000 4.160000 max 5.000000 4.630000
The above plots show the distribution of how I rate films compared to how the Letterboxd community rates the same films. One thing to note is that my rating scale is limited to 0.5-point intervals due to user rating constraints, so my ratings aren't continuous like Letterboxd's are since their scores are averages instead. As a result, Letterboxd's ratings only range between 2.04 - 4. 63, whereas my ratings use the full spectrum of 0.5 - 5.
This is something to consider when evaluating the model's performance. If a film receives a predicted rating of 3.61 then perhaps it should round down to a 3.5 so that it's functionally more similar to myRating.
# Temporary Feature - Note: This must be removed before modeling because Watchlist cannot use this information since there's no myRating
watched['rating_diff'] = watched['myRating'] - watched['Rating']
# Set fixed min/max for consistent coloring
vmin, vmax = -2.5, 2.5 # adjust based on actual max range
# 5 pt color scale
custom_colorscale = [
[0.0, 'purple'], # strong negative diff
[0.25, 'teal'], # mild negative
[0.5, 'green'], # agreement
[0.75, 'teal'], # mild positive
[1.0, 'purple'] # strong positive diff
]
fig = px.scatter(
watched,
x='Rating',
y='myRating',
color='rating_diff',
color_continuous_scale=custom_colorscale,
range_color=[vmin, vmax],
color_continuous_midpoint=0,
hover_name='Name',
title='My Ratings vs. Letterboxd Ratings (Green = Agreement, Purple = Disagreement)',
labels={
'Rating': 'Letterboxd Rating',
'myRating': 'My Rating',
'rating_diff': 'My Rating - LB Rating'
},
opacity=0.7
)
fig.add_shape(
type='path',
path='M 0 0 L 5 0 L 5 5 Z', # bottom-right triangle
fillcolor='rgba(255, 0, 0, 0.05)',
line=dict(width=0),
layer='below'
)
fig.add_shape(
type='path',
path='M 0 0 L 0 5 L 5 5 Z', # top-left triangle
fillcolor='rgba(0, 255, 0, 0.05)',
line=dict(width=0),
layer='below'
)
# Reference line y = x. Films above it i like more than lb, films under i like less.
fig.add_shape(
type='line',
x0=0, x1=5, y0=0, y1=5,
line=dict(color='gray', dash='dash')
)
fig.update_layout(
width=700,
height=700,
xaxis=dict(range=[0, 5], constrain='domain', scaleanchor='y', scaleratio=1, gridcolor='lightgray'),
yaxis=dict(range=[0, 5], gridcolor='lightgray'),
template='plotly_white',
margin=dict(l=50, r=50, t=60, b=50),
coloraxis_colorbar=dict(title='Rating Difference')
)
fig.update_traces(
marker=dict(
line=dict(
width=1,
color='black'
)
)
)
fig.update_traces(
marker=dict(
size=10, # DOT SIZE
line=dict(width=1, color='black')
)
)
fig.show()
It appears as though myRating is more frequently less than Rating than it is higher. To verify:
overPt5 = (watched['rating_diff'] > 0.5).sum()
underNegPt5 = (watched['rating_diff'] < -0.5).sum()
withinPt5 = ((watched['rating_diff'] >= -0.5) & (watched['rating_diff'] <= 0.5)).sum()
print(f"Count where rating_diff > 0.5: {overPt5}")
print(f"Count where rating_diff within 0.5 stars: {withinPt5}")
print(f"Count where rating_diff < -0.5: {underNegPt5}")
watched[['myRating', 'Rating']].corr()
Count where rating_diff > 0.5: 42 Count where rating_diff within 0.5 stars: 112 Count where rating_diff < -0.5: 101
| myRating | Rating | |
|---|---|---|
| myRating | 1.00000 | 0.40339 |
| Rating | 0.40339 | 1.00000 |
Most of the ratings are within 0.5 stars, but it is true that significantly more (101 total) films fall under the Letterboxd Rating than above it (42 films). That being said, there's still a positive correlation between the two values.
# Note to self: there are Year and Runtime charts that can be incorporated here but neither are very interesting visually at this point
genres = ['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family',
'Fantasy', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction',
'Thriller', 'TV Movie', 'War', 'Western']
# Calculate average myRating per genre
genre_avg = {}
for genre in genres:
avg_rating = watched.loc[watched[genre] == 1, 'myRating'].mean()
genre_avg[genre] = avg_rating
# Convert to df for plotting
genre_avg_df = pd.DataFrame.from_dict(genre_avg, orient='index', columns=['Average_myRating'])
genre_avg_df = genre_avg_df.sort_values('Average_myRating', ascending=False)
# ---------------------PLOTTING
# Calculate counts per genre
genre_counts = watched[genres].sum()
# Sort genres by count
sorted_genres = genre_counts.sort_values(ascending=False).index
# Reorder the average rating df to match this order
genre_avg_sorted = genre_avg_df.loc[sorted_genres]
# Reorder counts to match the sorted genres
genre_counts_sorted = genre_counts[sorted_genres]
fig, ax1 = plt.subplots(figsize=(12, 6))
# Bar plot for average rating
sns.barplot(x=genre_avg_sorted.index, y='Average_myRating', data=genre_avg_sorted, ax=ax1, palette='magma')
ax1.set_ylabel('Average myRating')
ax1.set_xlabel('Genre')
ax1.tick_params(axis='x', rotation=45)
# Secondary axis for counts
ax2 = ax1.twinx()
ax2.plot(genre_counts_sorted, color='red', marker='o', linestyle='-', linewidth=2)
ax2.set_ylabel('Count of Movies', color='red')
ax2.tick_params(axis='y', labelcolor='red')
plt.title('Average myRating and Movie Counts by Genre (Sorted by Count)')
plt.tight_layout()
plt.show()
C:\Users\DH\AppData\Local\Temp\ipykernel_12416\4129793445.py:33: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
My mean film rating is 3.38, which reflects similarly across most genres. I included a secondary axis to showcase the counts of each genre as they appear in Watched, since several of the less-viewed ones (Music, Documentary, Family, and TV Movie) appear to be favored more than they should be due to lack of representation. The important takeaway is that Science Fiction has the lowest average, which is accurate to my taste and hopefully influences the model accordingly.
Correlation¶
# Select only numerical columns to avoid errors
numeric_cols = watched.select_dtypes(include='number')
corr_matrix = numeric_cols.corr()
corr = watched.select_dtypes(include='number').corr()
# Filter to show only correlations above threshold
threshold = 0.5
mask = (abs(corr) >= threshold) & (corr != 1.0)
# Keep only rows/columns that have any strong correlation
filtered = corr.loc[mask.any(), mask.any()]
plt.figure(figsize=(10, 8))
sns.heatmap(filtered, annot=True, cmap='coolwarm', center=0)
plt.title("Filtered Correlation Matrix (|r| ≥ 0.5)")
plt.show()
rating_diff is a temporary feature introduced during EDA for graphing purposes. As expected, it introduces a high degree of correlation with myRating because of feature dependence. This collinearity isn't a problem since this is a feature that will be removed later since it can't be included in Watchlist since it's dependent on the target variable.
Beyond that, the four variables introduced via the Kaggle dataset all seem to have a high degree of collinearity as well. I'll address this later through feature engineering.
Below is a list of how each feature correlates (with absolute value for sorting) with the target variable myRating:
# Compute correlation with target 'myRating', then sort descending
corr_with_target = numeric_cols.corr()['myRating'].drop('myRating')
ranked_corr = corr_with_target.abs().sort_values(ascending=False)
print(ranked_corr)
rating_diff 0.897490 Rating 0.403390 Science Fiction 0.301138 Adventure 0.167277 Likes 0.166301 Fans 0.144662 Fantasy 0.143420 Total_ratings 0.126871 Watches 0.124441 Horror 0.093094 Music 0.090222 Documentary 0.088221 Action 0.084375 Year 0.084359 Crime 0.079276 Drama 0.078796 History 0.058523 Romance 0.044717 Family 0.036469 TV Movie 0.036469 War 0.029892 Thriller 0.024424 Western 0.022145 Comedy 0.018246 Runtime 0.007015 Mystery 0.004089 Animation NaN Name: myRating, dtype: float64
After the removal of rating_diff, Rating and Science Fiction are the strongest correlations. Runtime and Mystery have the weakest relationship to myRating, and will likely be removed from the model. Runtime came as a surprise, I would've guessed that there would've been a stronger correlation. Science Fiction has the strongest correlation of the genres, which I expected, but I may need to address genres like Documentary being represented higher (in the context of other genres, its not really that strong overall) despite the watched count being lower. For now, it's something to keep an eye on. Hopefully feature engineering will produce higher related features.
# Showcase other correlated features
corr_matrix = numeric_cols.corr().abs()
corr_pairs = corr_matrix.unstack().reset_index()
corr_pairs.columns = ['Feature1', 'Feature2', 'Correlation']
# Remove self correlations
corr_pairs = corr_pairs[corr_pairs['Feature1'] != corr_pairs['Feature2']]
# Drop duplicate pairs regardless of order
corr_pairs['sorted_pair'] = corr_pairs.apply(lambda row: '-'.join(sorted([row['Feature1'], row['Feature2']])), axis=1)
corr_pairs = corr_pairs.drop_duplicates('sorted_pair')
# Sort descending by correlation strength
corr_pairs = corr_pairs.sort_values(by='Correlation', ascending=False).drop(columns='sorted_pair')
print('Top 10 Correlated Feature Pairs')
print(corr_pairs.head(10))
Top 10 Correlated Feature Pairs
Feature1 Feature2 Correlation
670 Watches Total_ratings 0.992505
698 Likes Total_ratings 0.980076
668 Watches Likes 0.972998
55 myRating rating_diff 0.897490
697 Likes Fans 0.847889
726 Fans Total_ratings 0.783080
669 Watches Fans 0.774742
272 Documentary TV Movie 0.497038
109 Rating Fans 0.422446
63 Runtime Comedy 0.417586
The correlation matrix earlier showed that the four introduced Kaggle variables (Watches, Likes, Fans, Total_ratings) were highly collinear, so it's no surprise that they comprise 6 of the top 7 strongest collinearities. All of these will be handled by feature engineering and the removal of rating_diff. The remaining pairs are not strong enough to warrant any changes.
Feature Engineering¶
The introduction of Watches, Likes, Fans, and Total_ratings (which I sometimes refer to as 'Kaggle variables') make feature engineering possible in a way that wasn't possible in previous versions of the model. I've combined them in the following ways:
LikedRatio = Likes / Watches - Represents the percentage of users who "Liked" (or "Hearted") a film after watching it. Scaling Likes in this way is more informative because it accounts for popularity. A film with more "Likes" isn't necessarily more well-liked just because more people have seen it.
FanRatio = Fans / Watches - Represents the Fans variable in the same way that LikedRatio does. Users can only be a fan of 4 films at a given time, so FanRatio is a way to recognize a film's fanbase.
MovieAge = (2025 - Year) - Gives an 'Age' in years to each movie, 2025 represents the current year. Could potentially be a more helpful way of looking at when a movie came out than Year on its own.
These features are added to both Watched and Watchlist films
watched['LikedRatio'] = watched['Likes']/watched['Watches']
watched['FanRatio'] = watched['Fans'] / watched['Watches']
watched["MovieAge"] = 2025 - watched["Year"]
watchlist['LikedRatio'] = watchlist['Likes']/watchlist['Watches']
watchlist['FanRatio'] = watchlist['Fans'] / watchlist['Watches']
watchlist["MovieAge"] = 2025 - watchlist["Year"]
Rating_Z_Score
Using z-scores is one of my go-to methods for evaluating player performance in fantasy sports so naturally I was interested in applying it here as well, albeit in a slightly different manner. The idea is that a 4.2 rating should not be considered a 4.2 universally. A 4.2 with 100,000 ratings should carry more weight than a 4.2 with 1,000 ratings. By calculating the z-score of a film's Rating and scaling by its Total_ratings I can give greater context to the films rating in terms of magnitude.
# ZSCORE
mean_rating = watched['Rating'].mean()
std_rating = watched['Rating'].std()
watched['Rating_Z_Score'] = ((watched['Rating'] - mean_rating) / std_rating) * np.sqrt(watched['Total_ratings'])
watchlist['Rating_Z_Score'] = ((watchlist['Rating'] - mean_rating) / std_rating) * np.sqrt(watchlist['Total_ratings'])
Genre_Avg_Score
I wanted to account for how genre combinations (or hybrids) might work together. The current binary system assumes that a romantic comedy ("Romance" and "Comedy") is as funny as a "Comedy" and features the same romantic elements as a "Romance". In my opinion this is rarely true and generally both genres are dialed back to some degree so that they can work in combination with eachother instead. It might not be as funny as a pure comedy, nor as emotionally intense as a pure romance. In summary a rom-com isn't the same thing as a "Romance" + "Comedy", it is its own separate entity with distinct characteristics which aren't captured solely through binary genre combinations.
This feature averages my ratings for each genre in a film. If a film is tagged as both "Romance" and "Comedy", I take the mean of my average Romance rating and my average Comedy rating. This gives a rough subjective score for how I tend to respond to that combination of genres. This is more or less my solution to the Theme problem discussed above, it's not ideal but it's a band-aid for now.
# Genre Average Score
# creates dictionary to pair average myRating per genre
genre_avg_ratings = {
genre: watched.loc[watched[genre] == 1, 'myRating'].mean()
for genre in genres
}
# function to compute genre-based score for each movie. averages the genre score for each genre in the movie
def calculate_genre_score(row):
active_genres = [genre for genre in genres if row[genre] == 1]
avg_scores = [genre_avg_ratings[genre] for genre in active_genres]
return sum(avg_scores) / len(avg_scores)
watched['Genre_Avg_Score'] = watched.apply(calculate_genre_score, axis=1)
watchlist['Genre_Avg_Score'] = watchlist.apply(calculate_genre_score, axis=1)
# Consider weighting this somehowe due to genres like Documentary being highly rated but with low watchcount
Now that feature engineering is done, I'll first test the new features for multicollinearity using VIF (Variance Inflation Factor) before returning to the correlation matrix method from before.
# VIF
# Under 5 = Good, 5-10 = Iffy, 10+ = Problem
new_features = ['LikedRatio', 'FanRatio', 'MovieAge', 'Rating_Z_Score', 'Genre_Avg_Score', 'Rating']
X = watched[new_features]
X = add_constant(X)
# VIF
vif_df = pd.DataFrame()
vif_df["Feature"] = X.columns
vif_df["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_df)
Feature VIF 0 const 488.578304 1 LikedRatio 6.026871 2 FanRatio 2.354043 3 MovieAge 1.438686 4 Rating_Z_Score 3.536108 5 Genre_Avg_Score 1.047493 6 Rating 6.875913
LikedRatio and Rating have a VIF > 5 which suggests higher collinearity than ideal. This is something to consider but it's not to the degree that I'm concerned about it at the moment. I'll remove some other features and then look at an updated correlation matrix. At this time I'll also remove other features that are weakly correlated, or are used as identifiers and not for modeling.
# ['Runtime', 'Mystery', 'Total_ratings', 'Likes'] can be reintroduced to test in RF
cols_to_remove = ['Runtime', 'Mystery', 'Date', 'Letterboxd URI', 'Director', 'Genres', 'Themes', 'rating_diff',
'Total_ratings', 'Likes', 'Fans', 'Total_ratings']
watched = watched.drop(columns=cols_to_remove, errors='ignore')
#drop these in numeric_cols too so that correlation matrix is easily copy/pastable
numeric_cols = numeric_cols.drop(columns=cols_to_remove, errors='ignore')
# Select only numerical columns to avoid errors
numeric_cols = watched.select_dtypes(include='number')
corr_matrix = numeric_cols.corr()
corr = watched.select_dtypes(include='number').corr()
# Filter to show only correlations above threshold
threshold = 0.7
mask = (abs(corr) >= threshold) & (corr != 1.0)
# Keep only rows/columns that have any strong correlation
filtered = corr.loc[mask.any(), mask.any()]
plt.figure(figsize=(10, 8))
sns.heatmap(filtered, annot=True, cmap='coolwarm', center=0)
plt.title("Filtered Correlation Matrix (|r| ≥ 0.7)")
plt.show()
correlations = watched.corr(numeric_only=True)['myRating'].sort_values(ascending=False)
print(correlations)
myRating 1.000000 LikedRatio 0.406282 Rating 0.403390 Rating_Z_Score 0.377289 Genre_Avg_Score 0.368167 FanRatio 0.251067 Watches 0.124441 Music 0.090222 Documentary 0.088221 MovieAge 0.084359 Crime 0.079276 Drama 0.078796 History 0.058523 Family 0.036469 TV Movie 0.036469 War 0.029892 Comedy 0.018246 Western -0.022145 Thriller -0.024424 Romance -0.044717 Year -0.084359 Action -0.084375 Horror -0.093094 Fantasy -0.143420 Adventure -0.167277 Science Fiction -0.301138 Animation NaN Name: myRating, dtype: float64
This looks much better than before. There are still some high correlations, but none over 0.9 which is a good sign. Of the remaining features it seems like several engineered ones (LikedRatio, Rating_Z_Score, Genre_Avg_Score, Fan_Ratio) have a higher correlation to myRating when compared to most of the preexisting features, which is promising. Likewise, there no weak signals remaining.
# CREATING A CHECKPOINT DF
watched.to_csv('../data/v2/watched_2_mr.csv', index=False)
watchlist.to_csv('../data/v2/watchlist_2_mr.csv',index = False)
# DELETE THIS ITS A TEST
print("watched columns:", watched.columns.tolist())
print("")
print("watchlist columns:", watchlist.columns.tolist())
watched columns: ['Name', 'Year', 'myRating', 'Rating', 'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Romance', 'Science Fiction', 'Thriller', 'TV Movie', 'War', 'Western', 'Watches', 'LikedRatio', 'FanRatio', 'MovieAge', 'Rating_Z_Score', 'Genre_Avg_Score'] watchlist columns: ['Date', 'Name', 'Year', 'Letterboxd URI', 'Director', 'Runtime', 'Rating', 'Genres', 'Themes', 'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'Thriller', 'TV Movie', 'War', 'Western', 'Watches', 'Likes', 'Fans', 'Total_ratings', 'LikedRatio', 'FanRatio', 'MovieAge', 'Rating_Z_Score', 'Genre_Avg_Score']
RANDOM FOREST¶
I chose to use a Random Forest as the first model for several reasons:
I was expecting different data types. There already are several (binary, engineered, discrete) currently, but initially I had a different vision of how directors would be implemented that ultimately didn't pan out (or at least yet). Random Forests would have been well equipped to handle it. I'll explain further later but to summarize I overestimated how many films I'd seen from each director on average.
This is subjective data. Since it's based on personal ratings, I thought that random foresting would give me a better chance at finding non-linear patterns vs. something like linear regression which would be more sensitive to outliers - which given the context there are many.
Collinearity and multicollinearity were my biggest concerns early on. I touched on it earlier but there was a lot of overlap between
ThemesandGenre. UltimatelyThemesnever ended up being used but initially they played a bigger role and I wanted something that would be able to handle it. The current version of the model handles collinearity well due to the correlation analysis (which I didn't do early on) and as a result I'm able to use Logistic Regression to some success now as well.
In summary: the model has evolved a lot since I started working on it. Initially it made the most sense to use a Random Forest, but now other options are available as well.
# setup
# Predicting myRating variable, everything else is features
target = 'myRating'
non_features = ['myRating', 'Name']
# for reference when you experiment with removing them
genres = ['Action', 'Adventure', 'Animation', 'Comedy', 'Crime',
'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music',
'Romance', 'Science Fiction', 'Thriller', 'TV Movie', 'War', 'Western']
# keeps the film title in so we can test on some stuff later
features = [col for col in watched.columns if col not in non_features]
print(features)
X = watched[features]
y = watched[target]
X.head(2)
Here's a visual of the features used in the model
# RANDOM FOREST MODEL
# split train/test (research variables)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 15)
# train the random forest
model = RandomForestRegressor(
n_estimators=200,
max_depth=10,
min_samples_split=5,
min_samples_leaf=2,
random_state=15
)
model.fit(X_train, y_train)
Evaluation¶
Metrics
# Evaluation
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared = False)
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
#-------------------K-FOLD CROSS VALIDATION
mae_scores = cross_val_score(model, X, y, cv=5, scoring="neg_mean_absolute_error")
print(f"Cross-validated MAE: {-np.mean(mae_scores):.2f}")
The Mean Average Error (MAE) is 0.75 which suggests that on average the models margin of error is ±0.75 Stars. Using 5 fold Cross Validation this drops to 0.74. This is a more general indicator of performance on unseen data since it averages across different test sets, unlike MAE which only references a single test set. In the context of myRating this sits near-perfectly between 0.5 and 1 star off. Given that this is a subjective model, that's not horrible. Ideally, this would be closer to 0.5 but it's a good start.
Feature Importance & Hyperparameter Tuning
#------------------ VISUALIZE FEATURE IMPORTANCE
importances = model.feature_importances_
feature_importance = pd.Series(importances, index=X.columns).sort_values(ascending=False)
# Plot top 10 features
feature_importance.head(10).plot(kind="barh")
plt.title("Top Feature Importances")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
The engineered features are dominating in terms of feature importance. I'm happy about this because earlier versions weighed Rating and Year very heavily and nothing else, which felt a bit rudimentary. On the other hand, Genre_Avg_Score isn't that great of a feature since it too has room for improvement. It still needs weighted to handle watch frequency for each genre and I'm curious if coming up with a more accurate metric would improve performance.
# GRID SEARCH
param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [None, 10, 20],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
}
# set up grid search
model = RandomForestRegressor(random_state=15)
grid_search = GridSearchCV(
estimator=model,
param_grid=param_grid,
scoring=make_scorer(mean_absolute_error, greater_is_better=False),
cv=5,
n_jobs=-1,
verbose=2
)
# run it
grid_search.fit(X, y)
# get the best model
print("Best parameters:", grid_search.best_params_)
print("Best MAE:", -grid_search.best_score_)
# testing again with best parameters
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared = False)
importances = best_model.feature_importances_
feature_importance = pd.Series(importances, index=X.columns).sort_values(ascending=False)
# Plot top 10 features
feature_importance.head(10).plot(kind="barh")
plt.title("Top Feature Importances")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
#-------------------K-FOLD CROSS VALIDATION
mae_scores = cross_val_score(best_model, X, y, cv=5, scoring="neg_mean_absolute_error")
print(f"Cross-validated MAE: {-np.mean(mae_scores):.2f}")
Using Grid Search for hyperparameter tuning gets the CV MAE down to 0.72, which is somewhat improved! The feature importances are more or less the same, but notably Year and MovieAge have switched places and Thriller has replaced Action in terms of the binary genre features.
Predictions¶
# PREDICTIONS
X_watchlist = watchlist[features]
watchlist["predictedRating"] = best_model.predict(X_watchlist)
# Organize the output so it's cleaner to look at
cols = ['Name','predictedRating', 'Year', 'Director', 'Runtime', 'Rating']
watchlist = watchlist[cols]
watchlist = watchlist.sort_values(by="predictedRating", ascending=False)
watchlist["predictedRating"] = watchlist["predictedRating"].round(2)
watchlist.to_csv("../data/v2/RF_predictions.csv", index=False)
print("Predictions saved")
The model is run on watchlist and a new variable, predictedRating gives the models predicted score of the film in place of myRating.
results = X_test.copy()
results['actual'] = y_test
results['predicted'] = y_pred
results['error'] = abs(results['actual'] - results['predicted'])
# Merge back full metadata, including 'Name'
results = results.merge(watched[['Name']], left_index=True, right_index=True)
worst = results.sort_values(by='error', ascending=False).head(10)
print(worst[['Name', 'actual', 'predicted', 'error']])
Of the set that the model was tested on, these were the 10 films that had the largest discrepancy between the predictedRating and the actual rating (which is myRating). I'm not concerned with Buffalo '66 "only" getting a 4.04 because that's a high enough score that suggests it's good. Recall that Letterboxd Rating doesn't even extend to 5.0. Conversely, The Big Chill and A Serious Man, two films that I hate, get high 2's, but since Rating doesn't go below 2.0, I would recognize that these are low scoring. Several of these can be chalked up to subjectivity (I don't like Sci-Fi but I liked 10 Cloverfield Lane which is anomalous), but River's Edge and Rumble Fish are off by far too much without an intuitive explanation. The remainder of the films in the set and the difference between actual and predicted ratings can be seen below:
results['error'] = results['predicted'] - results['actual']
vmin, vmax = -2.5, 2.5
custom_colorscale = [
[0.0, 'purple'],
[0.25, 'teal'],
[0.5, 'green'],
[0.75, 'teal'],
[1.0, 'purple']
]
fig = px.scatter(
results,
x='predicted',
y='actual',
color='error',
color_continuous_scale=custom_colorscale,
range_color=[vmin, vmax],
color_continuous_midpoint=0,
hover_name='Name',
title='Predicted vs. Actual Ratings (Green = Agreement, Purple = Disagreement)',
labels={
'predicted': 'Predicted Rating',
'actual': 'Actual Rating',
'error': 'Predicted - Actual'
},
opacity=0.7
)
fig.add_shape(
type='path',
path='M 0 0 L 5 0 L 5 5 Z',
fillcolor='rgba(255, 0, 0, 0.05)',
line=dict(width=0),
layer='below'
)
fig.add_shape(
type='path',
path='M 0 0 L 0 5 L 5 5 Z',
fillcolor='rgba(0, 255, 0, 0.05)',
line=dict(width=0),
layer='below'
)
fig.add_shape(
type='line',
x0=0, x1=5, y0=0, y1=5,
line=dict(color='gray', dash='dash')
)
fig.update_layout(
width=700,
height=700,
xaxis=dict(title='Predicted Rating', range=[0, 5], constrain='domain', scaleanchor='y', scaleratio=1, gridcolor='lightgray'),
yaxis=dict(title='Actual Rating', range=[0, 5], gridcolor='lightgray'),
template='plotly_white',
margin=dict(l=50, r=50, t=60, b=50),
coloraxis_colorbar=dict(title='Rating Difference')
)
fig.update_traces(
marker=dict(
size=10,
line=dict(width=1, color='black')
)
)
fig.show()
errors = y_test - y_pred
plt.figure(figsize=(8, 5))
sns.histplot(errors, bins=10, kde=True)
plt.axvline(0, color='red', linestyle='--')
plt.title("Distribution of Prediction Errors")
plt.xlabel("Actual - Predicted Rating")
plt.ylabel("Count")
plt.show()
Explaining the Model¶
I'll be using SHAP (SHapley Additive exPlanations) to explain how the model works. SHAP produces visuals that explain the model's global features and how they contribute to the model's predictions.
# SHAP
explainer = shap.Explainer(best_model, X_train)
shap_values = explainer(X_test)
#global feature importance. longer means more influential, not necessairly positively or negatively.
shap.plots.bar(shap_values)
This bar plot shows the global feature importance based on the mean absolute SHAP values across all films in the test set. The length of each bar shows the degree of how important that feature is on average in the context of the predictions. It's unsurprising that Genre_Avg_Score is the most important feature
#red ones matter the most. if they're to the right then they boost, if theyre to the left they subtract from rating
shap.plots.beeswarm(shap_values)
This is a beeswarm plot. Each point represents a different film in the test set. The y-axis includes the most relevant features, and the x-axis is the degree to which the feature pushes the models rating. Results to the right of the line increase it, results on the left decrease it. The color of each dot represents the value of the feature itself, a high Rating is red and a low one is blue. To interpret this, the left-furthest blue dot in the Rating category is low rated (Blue) and as a result lowers the predictedRating by ~-0.3 on its own.
# start at E and each one shows how much score is beign added/removed
shap.plots.waterfall(shap_values[15]) # 10th movie in the set
This is a waterfall plot of the first movie in the test set. This is useful for exploring predictions on an individual film level. The E[f(X)]] is the expected value of the prediction, and then you follow each feature's contribution until you arrive at f(x) which is the predictedRating. In this example, the features drive it's score upward before being wiped out by the Genre_Avg_Score. This is telling - the Genre_Avg_Score has felt too "overpowered" throughout the project, which isn't good considering it's the worst engineered feature. Some adjustments are definitely in order.
The following graph shows the predictedRating for each film in my watchlist. It's graphed against Rating to remain consistent with previous rating difference graphs but note that Rating is only one feature and this is included for visual purposes only.
watchlist['rating_diff'] = watchlist['predictedRating'] - watched['Rating']
# Set fixed min/max for consistent coloring
vmin, vmax = -2.5, 2.5 # adjust based on actual max range
# 5 pt color scale
custom_colorscale = [
[0.0, 'purple'], # strong negative diff
[0.25, 'teal'], # mild negative
[0.5, 'green'], # agreement
[0.75, 'teal'], # mild positive
[1.0, 'purple'] # strong positive diff
]
fig = px.scatter(
watchlist,
x='Rating',
y='predictedRating',
color='rating_diff',
color_continuous_scale=custom_colorscale,
range_color=[vmin, vmax],
color_continuous_midpoint=0,
hover_name='Name',
title='Predicted Ratings vs. Letterboxd Ratings (Green = Agreement, Purple = Disagreement)',
labels={
'Rating': 'Letterboxd Rating',
'predictedRating': 'Predicted Rating',
'rating_diff': 'Pred Rating - LB Rating'
},
opacity=0.7
)
fig.add_shape(
type='path',
path='M 0 0 L 5 0 L 5 5 Z', # bottom-right triangle
fillcolor='rgba(255, 0, 0, 0.05)',
line=dict(width=0),
layer='below'
)
fig.add_shape(
type='path',
path='M 0 0 L 0 5 L 5 5 Z', # top-left triangle
fillcolor='rgba(0, 255, 0, 0.05)',
line=dict(width=0),
layer='below'
)
# Reference line y = x. Films above it i like more than lb, films under i like less.
fig.add_shape(
type='line',
x0=0, x1=5, y0=0, y1=5,
line=dict(color='gray', dash='dash')
)
fig.update_layout(
width=700,
height=700,
xaxis=dict(range=[0, 5], constrain='domain', scaleanchor='y', scaleratio=1, gridcolor='lightgray'),
yaxis=dict(range=[0, 5], gridcolor='lightgray'),
template='plotly_white',
margin=dict(l=50, r=50, t=60, b=50),
coloraxis_colorbar=dict(title='Rating Difference')
)
fig.update_traces(
marker=dict(
line=dict(
width=1,
color='black'
)
)
)
fig.update_traces(
marker=dict(
size=10, # DOT SIZE
line=dict(width=1, color='black')
)
)
fig.show()
pred_mean = watchlist['predictedRating'].mean().round(2)
print(f"{pred_mean} is the mean `predictedRating`")
The results seem to be clustered around the same general area, which is in the 3-4 predictedRating range beneath the y=x. This is a similar pattern to what was observed in the EDA. Similarly, my myRating mean was 3.38 and now predictedRating is 3.43 so they're very close to eachother.
Within the watchlist are two movies that I've already seen, but have been meaning to rewatch: No Country for Old Men and The Master.
No Country for Old Men is a movie that's widely considered a classic. I've seen it twice and I'm not the biggest fan of it, I just think it's OK. It's on my watchlist because my friends all love it and want me to give it another chance. The Letterboxd Rating for this movie is 4.32 and my predictedRating is only 3.75. This is probably a little higher than I would give it, but the fact that it didn't automaticaly score it above a 4.0 tells me that the model understands my preferences to some degree considering this is a pretty unpopular opinion and is a good example of subjectivity.
The Master is a movie by my favorite director Paul Thomas Anderson. I like this movie. Ironically, I hate assigning numerical ratings to movies but if I had to I would probably give it a 4. It's on my watchlist because I've been rewatching all of PTA's movies and this is the only one left that I haven't rewatched yet. The Rating for The Master is 4.0 and my predictedRating is 3.6. It's lower than I would probably give it, but since it's >3.5 I would interpret this as a movie that I would generally like.
Improvements¶
The most obvious improvement would be to change how Genre_Avg_Rating is calculated. It has too much of an affect on the predictedRating - which would be OK if it were a reliable metric. I'm not happy with how it's calculated and would like to try and improve it to weight based on how many films of each genre have been watched.
Another potential improvement is experimenting with XGBoost. This is a quick way to improve Random Forest performance, but based on past results I'm not sure my dataset is large enough to see any noticeable difference. It's worth exploring though. On that same note, I think that watching more movies would help by providing more data to train on.
Perhaps the biggest potential improvement would be to use a classification model instead. While reviewing the predictions I realized that I don't care about the numerical rating of predictedRating as much as I thought I would. In fact, I don't care about assigning numerical value to films at all. I found the rating process to be difficult. There were a few instances where I felt the need to score films above my enjoyment level just because I could appreciate things about them, the movie Stalker being a prime example. There are many things about this movie that are objectively great and that I can appreciate, but I don't enjoy the movie. It's painfully slow, and it's not something I ever want to revisit, but it feels like a disservice to rate it anything lower than a 3 because of what it is. Because of this, I was essentially using '3.5' as the rating threshold for movies that I like. Conversely, I found it difficult to determine what the difference between something 0.5-1.5 should be. I'm not that concerned with the degree to which I'm going to dislike a movie. Being this subjective rating-wise doesn't do the model any favors. It's unrealistic to expect that a model would predict that I think Pineapple Express is a 5-star movie, because it isn't one - I just like it. For my purposes, it would be more valuable for a model to determine if I would like a movie, rather than rate it highly.
For the purpose of my objective, I think reframing the project could produce different results. For my next attempt, I'll replace myRating with a variable like "Sentiment", and change my rating scale from 0.5-5.0 to something like 'Liked/Neutral/Disliked' and then try either Logistic Regression or another Random Forest model to classify movies into these categories instead.
--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) Cell In[1], line 4 1 from nbconvert import HTMLExporter 2 import nbformat ----> 4 notebook = nbformat.read(open("moviepredictorv2.ipynb"), as_version=4) 5 html_exporter = HTMLExporter() 6 html_exporter.exclude_input = True File ~\anaconda3\Lib\site-packages\nbformat\__init__.py:169, in read(fp, as_version, capture_validation_error, **kwargs) 141 """Read a notebook from a file as a NotebookNode of the given version. 142 143 The string can contain a notebook of any version. (...) 165 The notebook that was read. 166 """ 168 try: --> 169 buf = fp.read() 170 except AttributeError: 171 with open(fp, encoding="utf8") as f: # noqa: PTH123 File ~\anaconda3\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final) 22 def decode(self, input, final=False): ---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 409154: character maps to <undefined>